This is a document detailing how to process mapped NGS reads (sorted and indexed .bam files), and derive the SFS, site-wise estimates of Theta and Tajima’s D using ANGSD. This is based on the wiki entry here. At the moment this is estiamted based on several teosinte samples from the HapMap2 project. These are stored on Farm in:
The examples TIL*.bam are all teosinte, but TIL8 and TIL25 are Z. mays mexicana and will be removed from future analysis. The remiaing TIL*.bam accessions are all Zea mays parviglumis and can be analysed together.
In order to process the data the order of opporations should be roughly:
Below is an example of work flow given in the wiki, enabling the eventual estimation of Tajima’s D.
./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 2 -P 24 -out out
./misc/realSFS out.saf 20 -P 24 > out.sfs
./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 2
./misc/thetaStat make_bed out.thetas.gz
#Estimate for every Chromosome/scaffold
./misc/thetaStat do_stat out.thetas.gz -nChr 20
#Do a sliding window analysis based on the output from the make_bed command.
./misc/thetaStat do_stat out.thetas.gz -nChr 20 -win 50000 -step 10000 -outnames theta.thetasWindow.gz
First estimate the site allele frequency likelihood. This requires several things listed below:
1. A file with listed, one per line, all the .bam files you want to analyse. You can grab the files you need by cd’ing to the dir they are in and executing this code on the command line.
ls $PWD/*.bam > bam.list
2. Choose which method you want to use with:
-doSaf [int 1-4]
There are four options listed in detail here.
3. Define your ancestral allele using the flag:
-anc <path/to/referencegenome>
In my case we do not know the ancestral allele state, which means instead of derived allele SFS we need a minor allele SFS (a folded SFS). We can still provide an ancestral estimate using the reference genome (B73), but once folding is complete in becomes a minor allele SFS. We need to specify that we want a folded SFS wiht:
-fold 1
4. Define the method for estimating Genotype Likelihoods:
-GL [int 1-4]
details of the different methods are provided here.
5. Define the number of processors to use with:
-P [int]
6. Define the outfile name using:
-out <path/to/outfile>
there are several output files and a suffix will be added to each file.
You can also define the region of the genome you would like to analyse. On the first pass we will limit our analysis of the genome to the first 500,000 bp of chromosome 10 using:
-r 10:1-100000
I have a bam.list file that has the following accessions listed:
/home/sbyfield/HapMap2Teo/TIL01_merged.bam
/home/sbyfield/HapMap2Teo/TIL02_merged.bam
/home/sbyfield/HapMap2Teo/TIL03_merged.bam
/home/sbyfield/HapMap2Teo/TIL04-TIP454_merged.bam
/home/sbyfield/HapMap2Teo/TIL05_merged.bam
/home/sbyfield/HapMap2Teo/TIL06-TIP496_merged.bam
/home/sbyfield/HapMap2Teo/TIL07_merged.bam
/home/sbyfield/HapMap2Teo/TIL09_merged.bam
/home/sbyfield/HapMap2Teo/TIL10_merged.bam
/home/sbyfield/HapMap2Teo/TIL11_merged.bam
/home/sbyfield/HapMap2Teo/TIL12_merged.bam
/home/sbyfield/HapMap2Teo/TIL14-TIP498_merged.bam
/home/sbyfield/HapMap2Teo/TIL15_merged.bam
/home/sbyfield/HapMap2Teo/TIL16_merged.bam
/home/sbyfield/HapMap2Teo/TIL17_merged.bam
So I can run the following command to get the initial estimation of site allele frequency likelihood (SAF):
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-10000000
I extedned the region from 1:100,000 bp to 1:500,000 bp resulted in the follwoing error:
-> Reading fasta: /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa
-> Parsing 15 number of samples
-> Printing at chr: 10 pos:5698 chunknumber 100
-> Printing at chr: 10 pos:7154 chunknumber 200
-> Printing at chr: 10 pos:26725 chunknumber 300
-> Printing at chr: 10 pos:29944 chunknumber 400
-> Printing at chr: 10 pos:41538 chunknumber 500
-> Printing at chr: 10 pos:44283 chunknumber 600
-> Printing at chr: 10 pos:46373 chunknumber 700
-> Printing at chr: 10 pos:51852 chunknumber 800
-> Printing at chr: 10 pos:51952 chunknumber 900
PROBS at: 10 52032
-> Printing at chr: 10 pos:70089 chunknumber 1000
-> Printing at chr: 10 pos:70974 chunknumber 1100
-> Printing at chr: 10 pos:123765 chunknumber 1200
PROBS at: 10 143508
-> Printing at chr: 10 pos:143560 chunknumber 1300
-> Printing at chr: 10 pos:144692 chunknumber 1400
PROBS at: 10 146774
-> Printing at chr: 10 pos:147193 chunknumber 1500
-> Printing at chr: 10 pos:176348 chunknumber 1600
-> Printing at chr: 10 pos:176962 chunknumber 1700
-> Printing at chr: 10 pos:182529 chunknumber 1800
-> Printing at chr: 10 pos:207307 chunknumber 1900
-> Printing at chr: 10 pos:266486 chunknumber 2000
-> Printing at chr: 10 pos:292877 chunknumber 2100
-> Printing at chr: 10 pos:317113 chunknumber 2200
-> Printing at chr: 10 pos:372899 chunknumber 2300
-> Printing at chr: 10 pos:443060 chunknumber 2400
-> Printing at chr: 10 pos:475084 chunknumber 2500
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"smallFolded_10_100000.arg"
->"smallFolded_10_100000.saf"
->"smallFolded_10_100000.saf.pos.gz"
-> Tue Oct 28 15:07:25 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 43.92 sec
[ALL done] walltime used = 25.00 sec
real 0m0.000s
user 0m0.000s
sys 0m0.000s
I examined extending the region to 1:1,000,000 bp with:
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-1000000 -minMapQ 1 -minQ 20
and got this error in the stderr log file:
-> Reading fasta: /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa
-> Parsing 15 number of samples
-> Printing at chr: 10 pos:5698 chunknumber 100
-> Printing at chr: 10 pos:7154 chunknumber 200
-> Printing at chr: 10 pos:26725 chunknumber 300
-> Printing at chr: 10 pos:29944 chunknumber 400
-> Printing at chr: 10 pos:41538 chunknumber 500
-> Printing at chr: 10 pos:44283 chunknumber 600
-> Printing at chr: 10 pos:46373 chunknumber 700
-> Printing at chr: 10 pos:51852 chunknumber 800
-> Printing at chr: 10 pos:51952 chunknumber 900
PROBS at: 10 52032
-> Printing at chr: 10 pos:70089 chunknumber 1000
-> Printing at chr: 10 pos:70974 chunknumber 1100
-> Printing at chr: 10 pos:123765 chunknumber 1200
PROBS at: 10 143508
-> Printing at chr: 10 pos:143560 chunknumber 1300
-> Printing at chr: 10 pos:144692 chunknumber 1400
PROBS at: 10 146774
-> Printing at chr: 10 pos:147193 chunknumber 1500
-> Printing at chr: 10 pos:176348 chunknumber 1600
-> Printing at chr: 10 pos:176962 chunknumber 1700
-> Printing at chr: 10 pos:182529 chunknumber 1800
-> Printing at chr: 10 pos:207307 chunknumber 1900
-> Printing at chr: 10 pos:266486 chunknumber 2000
-> Printing at chr: 10 pos:292877 chunknumber 2100
-> Printing at chr: 10 pos:317113 chunknumber 2200
-> Printing at chr: 10 pos:372899 chunknumber 2300
-> Printing at chr: 10 pos:443060 chunknumber 2400
-> Printing at chr: 10 pos:475084 chunknumber 2500
-> Printing at chr: 10 pos:537668 chunknumber 2600
-> Printing at chr: 10 pos:549090 chunknumber 2700
-> Printing at chr: 10 pos:574173 chunknumber 2800
-> Printing at chr: 10 pos:589414 chunknumber 2900
-> Printing at chr: 10 pos:615307 chunknumber 3000
-> Printing at chr: 10 pos:633719 chunknumber 3100
-> Printing at chr: 10 pos:651276 chunknumber 3200
-> Printing at chr: 10 pos:667980 chunknumber 3300
-> Printing at chr: 10 pos:687278 chunknumber 3400
-> Printing at chr: 10 pos:703154 chunknumber 3500
-> Printing at chr: 10 pos:730360 chunknumber 3600
-> Printing at chr: 10 pos:750631 chunknumber 3700
-> Printing at chr: 10 pos:770970 chunknumber 3800
-> Printing at chr: 10 pos:789476 chunknumber 3900
-> Printing at chr: 10 pos:808868 chunknumber 4000
-> Printing at chr: 10 pos:829136 chunknumber 4100
-> Printing at chr: 10 pos:861912 chunknumber 4200
-> Printing at chr: 10 pos:888113 chunknumber 4300
-> Printing at chr: 10 pos:911758 chunknumber 4400
-> Printing at chr: 10 pos:935262 chunknumber 4500
-> Printing at chr: 10 pos:958633 chunknumber 4600
-> Printing at chr: 10 pos:974523 chunknumber 4700
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"smallFolded_10_100000.arg"
->"smallFolded_10_100000.saf"
->"smallFolded_10_100000.saf.pos.gz"
-> Tue Oct 28 15:16:32 2014
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 82.42 sec
[ALL done] walltime used = 38.00 sec
real 0m0.000s
user 0m0.000s
I extedned the region to 10,000,000 bp. Next it is probably wise to go ahead an see if I can move through the rest of the analyses and actually producce a folded SFS.
The next step is to convert the .saf file to a SFS using:
realSFS
There there is some confusion in the wiki for ANGSD about exactly how to do this, sometimes it says you need to declare \(2n+1\) chromosomes, sometimes \(2n\) chromosomes and sometimes \(n\), where \(n\) is the number of samples. I have found that using \(n\) gets ANGSD to run. For example, if I have 15 samples (as I do in this example) I will declare 15 chromosomes as an argument to realSFS:
realSFS teoFolded_chr10.saf 15 -maxIter 100 -P 12 > teoFolded_chr10.sfs
Here is the resulting SFS for the first 10,000,000 bp of chromosome 10 for 15 teosinte lines:
sfs<-c(-0.192947, -2.444628, -3.249441, -4.179364, -4.586453, -5.137299 ,-5.489250, -5.845022, -6.020604, -6.241799, -6.441396, -6.546019, -6.785617, -6.850590, -6.869085 ,-7.022943)
barplot(exp(sfs[-1]),col="darkgrey", names.arg=1:(length(sfs)-1), ylab="probability",xlab="minor allele freqquency", main = "Chr10:1:10,000,000")
This worked well, so I attempted the whole thing again for the entire Chr10 using:
>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -r 10 -P 12 -minMapQ 1 -minQ 20
>realSFS teoFolded_chr10.saf 15 -maxIter 100 -P 12 > teoFolded_chr10.sfs
sfs<-c(-0.206044, -2.398034, -3.210098, -4.072419, -4.525486, -5.026914, -5.395742, -5.723438, -5.943467, -6.157773, -6.327133, -6.439034, -6.542996, -6.667280, -6.892473 -7.044611)
barplot(exp(sfs[-1]),col="darkgrey", names.arg=1:(length(sfs)-1), ylab="probability",xlab="minor allele freqquency", main = "Chr10")
This worked well also so I extended it to the whole genome using:
>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 1 -minQ 20
>realSFS teoFolded.saf 15 -maxIter 100 -P 18 > teoFolded.sfs
The next step is too calculate site wise estimates of theta for the region of interest. This can be acheived with a command like this:
>angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 2
So for jsut the Chr10 data my command looks like:
>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 1 -minQ 20
So to take a look at the data you can do:
gunzip -c teoThetas_ch10.thetas.gz | head
The ouput looks like this..
#Chromo Pos Watterson Pairwise thetaSingleton thetaH thetaL
10 19 -3.057582 -3.592928 -Inf -Inf -Inf
10 20 -3.057582 -3.592928 -Inf -Inf -Inf
10 21 -3.057582 -3.592928 -Inf -Inf -Inf
10 22 -3.057582 -3.592928 -Inf -Inf -Inf
10 23 -3.057582 -3.592928 -Inf -Inf -Inf
10 24 -3.057582 -3.592928 -Inf -Inf -Inf
10 25 -3.057582 -3.592928 -Inf -Inf -Inf
10 26 -3.057582 -3.592928 -Inf -Inf -Inf
10 27 -3.057582 -3.592928 -Inf -Inf -Inf
Note that the last three columns cannot be calculated with a folded spectrum. It’s not important for calculating Tajima’s D in this case. The next step is to create a BED file of the genome region.
#create a binary version of thete.thetas.gz
misc/thetaStat make_bed teoThetas_ch10.thetas.gz
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.gz -nChr 20 -win 5000 -step 1000 -outnames teothetasWindow_chr10.gz
The outout of the call looks like this:
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(981,5553)(1000,6000)(1000,6000) 10 3500 229.075532 128.679660 0.000000 0.000000 0.000000 -1.951619 1.022695 2.198914 0.627387 -1.271722 4572
(1744,6553)(2000,7000)(2000,7000) 10 4500 252.158774 146.216524 0.000000 0.000000 0.000000 -1.871340 1.056264 2.200253 0.647711 -1.272027 4809
(2737,7373)(3000,9633)(3000,8000) 10 5500 245.769776 142.231370 0.000000 0.000000 0.000000 -1.876311 1.054038 2.199907 0.646415 -1.271948 4636
(3737,7373)(4000,9633)(4000,9000) 10 6500 198.005228 116.321264 0.000000 0.000000 0.000000 -1.836315 1.068549 2.196625 0.655984 -1.271199 3636
(4553,7740)(5000,10000)(5000,10000) 10 7500 160.708626 92.637354 0.000000 0.000000 0.000000 -1.884167 1.046818 2.192724 0.643425 -1.270307 3187
(5553,8515)(6000,12805)(6000,11000) 10 8500 135.715894 77.997890 0.000000 0.000000 0.000000 -1.890558 1.042092 2.188930 0.641280 -1.269437 2962
(6553,8515)(7000,12805)(7000,12000) 10 9500 84.359327 46.535357 0.000000 0.000000 0.000000 -1.988064 0.994260 2.174266 0.614661 -1.266046 1962
(7373,8695)(9633,13000)(8000,13000) 10 10500 62.033024 37.406809 0.000000 0.000000 0.000000 -1.755972 1.080761 2.160589 0.671020 -1.262847 1322
Here is description of the file format form the ANGSD wiki:
"The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.
We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L. And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E. The final column is the effetive number of sites with data in the window."
Make an attempt to draw estimates of Tajima’s D across the whole of Chr10:
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10.gz.pestPG", header = T, sep = "\t")
plot(TJD[,3],TJD[,9], pch = 16, cex=.5)
Now examine Tajima’s D with proportion of bases covered per window.
library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10.gz.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5)
#make a data table to plot with ggplot
TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
plt <- ggplot(TajimaD, aes(x= prop,y=TD))+
geom_point()+
stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
plt
I spoke with Jeff and he indicated that the parameters are probably too liberal, resultingin bad mapping and an over-estimation of rare alleles (hence negative TD) . For example, he tells me that simulations indicated that BWA-MEM maps better alignments with ninimum aligment score of 40. So, I will re-run the analysis with modified parameters detailed below:
#the minimum number of indivduals with at least 1 read.
-minInd [int]
#The m
-minMapQ 20
-minQ 40
I will re-run analyses with the first 10,000,000 bp of chromosome 10 using the following command:
>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-1000000 -minMapQ 20 -minQ 40
>realSFS teoFolded_chr10_10000000.saf 15 -maxIter 100 -P 12 > teoFolded_chr10_10000000.sfs
using the following slurm batch file:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
set -e
set -u
echo "Starting job.."
echo ""
#for the sfs defining, in this case, a false ancestral allele (i.e. the B73 genome)
echo "angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_ch10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 20 -minQ 40 -r 10:1-10000000"
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_ch10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 20 -minQ 40 -r 10:1-10000000
#make it into a folded SFS, or minor allele spectrum
echo "realSFS teoFolded_ch10_10000000.saf 15 -maxIter 100 -P 18 > teoFolded_ch10_10000000.sfs"
realSFS teoFolded_ch10_10000000.saf 15 -maxIter 100 -P 18 > teoFolded_ch10_10000000.sfs
echo ""
time
echo ""
echo "done"
the JOB ID is: 1480928
Following this I used ANGSD to estimate theta:
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10_10000000 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_100000.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000
Using this file:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u
echo "Starting job.."
time
echo ""
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10_10000000 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_ch10_10000000.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 20 -minQ 40
echo ""
time
echo ""
echo "done"
ID = 1483211
Then I can estimate TajD over sliding windows using:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u
echo ""
time
echo "making BED"
thetaStat make_bed teoThetas_ch10_10000000.thetas.gz
thetaStat do_stat teoThetas_ch10_10000000.thetas.gz -nChr 15 -win 5000 -step 1000 -outnames teothetasWindow_chr10_10000000.gz
echo "done"
time
The output look like this:
## thetaStat VERSION: 0.01 build:(Oct 23 2014,15:39:05)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter tW tP tF tH tL Tajima fuf fud fayh zeng nSites
(71,388)(300,1530)(300,800) 10 550 10.460646 6.107603 0.000000 0.000000 0.000000 -1.762926 0.951768 1.945277 0.634197 -1.207356 317
(137,388)(400,1530)(400,900) 10 650 8.275646 4.823003 0.000000 0.000000 0.000000 -1.744697 0.924989 1.889293 0.628344 -1.191207 251
(237,388)(500,1530)(500,1000) 10 750 4.888378 2.850494 0.000000 0.000000 0.000000 -1.673753 0.854698 1.733225 0.613838 -1.141940 151
(296,388)(611,1530)(600,1100) 10 850 3.259762 2.020438 0.000000 0.000000 0.000000 -1.458366 0.836464 1.586624 0.634632 -1.089353 92
(346,388)(700,1530)(700,1200) 10 950 1.506088 0.941821 0.000000 0.000000 0.000000 -1.258182 0.681794 1.267787 0.587341 -0.950147 42
(388,388)(1530,1530)(800,1300) 10 1050 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -nan -nan -nan -nan 0
(388,388)(1530,1530)(900,1400) 10 1150 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -nan -nan -nan -nan 0
(388,388)(1530,1530)(1000,1500) 10 1250 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -nan -nan -nan -nan 0
library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10_10000000.gz.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "Tajima's D")
plot(TJD[,3]/(10^6),TJD[,9], pch = 16, cex=.5, xlab = "position (Mbp)",ylab = "Tajima's D")
#make a data table to plot with ggplot
#TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
#plt <- ggplot(TajimaD, aes(x= prop,y=TD))+
# geom_point()+
# stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
#plt
The mapping quality seems to be an issue and also I need to limit the number of sites to only those where -minInd == 12 of the 15 samples. Note that:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
set -e
set -u
echo "Starting job.."
echo ""
#for the sfs defining, in this case, a false ancestral allele (i.e. the B73 genome)
echo "angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10M_minInd12 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-10000000 -fold 1 -P 18 -minMapQ 40 -minQ 40 -minInd 12"
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10M_minInd12 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-10000000 -fold 1 -P 18 -minMapQ 40 -minQ 40 -minInd 12
#make it into a folded SFS, or minor allele spectrum
echo "realSFS teoFolded_chr10_10M_minInd12.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_10M_minInd12.sfs"
realSFS teoFolded_chr10_10M_minInd12.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_10M_minInd12.sfs
echo ""
time
echo ""
echo "done"
Next step is to calculate thetas over the genomic region of interest using:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u
echo "Starting job.."
time
echo ""
>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_10M_minInd12.sfs -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_10M_minInd12.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 40 -minQ 40 minInd 12
echo ""
time
echo ""
echo "done"
Now to scan the genome by window, so we can plot the data using:
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u
echo ""
time
echo "making BED"
thetaStat make_bed teoFolded_chr10_10M_minInd12.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_10M_minInd12.sfs.thetas.gz -nChr 15 -win 5000 -step 1000 -outnames teoFolded_chr10_10M_minInd12
echo "done"
time
Jeff notes that these line (from HapMap2) are all selfed for many generations and so need the expected heterozygosity will change. These samples are Inbred. You can indicate this to ANGSD with an inbreeding file* containing the appropriate inbreeding coefficient for each sample. In this case 1.
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
#angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_100M_minInd15 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-100000000 -fold 1 -P 18 -minMapQ 20 -minQ 40 -minInd 15 -indF InbreedingCoEff.txt
realSFS teoFolded_chr10_100M_minInd15.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_100M_minInd15.sfs
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_100M_minInd15 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_100M_minInd15.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-100000000 -minMapQ 30 -minQ 40 minInd 15 -indF InbreedingCoEff.txt
thetaStat make_bed teoFolded_chr10_100M_minInd15.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_100M_minInd15.sfs.thetas.gz -nChr 15 -win 5000 -step 1000 -outnames teoFolded_chr10_100M_minInd15
echo "done"
time
library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teoFolded_chr10_100M_minInd15.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "Tajima's D")
scatter.smooth(x=TJD[,14]/5000, y=TJD[,9], cex = 0.01,col = "red", lwd = 3, lty = 3, span = 0.3, family = "gaussian")
plot(TJD[,14]/5000,(TJD[,5]/TJD[,14]), pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "pi")
scatter.smooth(x=TJD[,14]/5000, y=(TJD[,5]/TJD[,14]), cex = 0.01,col = "red", lwd = 3, lty = 3, span = 0.3, family = "gaussian")
plot(TJD[,3]/(10^6),TJD[,9], pch = 16, cex=.5, xlab = "position (Mbp)",ylab = "Tajima's D")
hist(TJD[,9][TJD[,9] != 0], col = "cornflowerblue", xlab="Tajima's D")
#make a data table to plot with ggplot
#TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
#plt <- ggplot(TajimaD, aes(x= prop,y=TD))+
# geom_point()+
# stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
#plt
Retry the analysis with a few parameter changes:
most importantly I was changed the -doSaf paramter to 2 (-doSaf 2), which DOES NOT assume HWE like the -doSaf 1 option DOES.
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
#angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doMaf 1 -uniqueOnly 0 -doMajorMinor 1 -doSaf 2 -out teoFolded_chr10_100M_minInd15 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-100000000 -fold 1 -P 18 -minMapQ 30 -minQ 20 -minInd 15 -indF InbreedingCoEff.txt
#realSFS teoFolded_chr10_100M_minInd15.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_100M_minInd15.sfs
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_100M_minInd15.sfs -doThetas 1 -doSaf 2 -fold 1 -pest teoFolded_chr10_100M_minInd15.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-100000000 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt -doMajorMinor 1 -doMaf 1
thetaStat make_bed teoFolded_chr10_100M_minInd15.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_100M_minInd15.sfs.thetas.gz -nChr 15 -win 5000 -step 1000 -outnames teoFolded_chr10_100M_minInd15
echo "done"
time
I removed the minimum coverage (of individuals, i.e. minInd). Once the analysis is run we can then parse the data according to how many individuals are represented for any given position. I set antother job running for the whole genome using this command (in a slurm batch file):
#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
echo "Starting jog:"
date
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doMaf 1 -uniqueOnly 0 -doMajorMinor 1 -doSaf 2 -out teoFolded -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt
realSFS teoFolded.saf 15 -maxIter 100 -P 18 > teoFolded.sfs
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded.sfs -doThetas 1 -doSaf 2 -fold 1 -pest teoFolded.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 18 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt -doMajorMinor 1 -doMaf 1
thetaStat make_bed teoFolded.sfs.thetas.gz
thetaStat do_stat teoFolded.sfs.thetas.gz -nChr 15 -win 5000 -step 1000 -outnames teoFolded.thetasWindow -P 18
echo "done"
date